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Abstract 

Recent  work  in  distance  metric  learning  has  produced  numerous  methods  aimed 
at  learning  transformations  of  data  that  best  align  with  provided  sets  of  pairwise 
similarity  and  dissimilarity  constraints.  The  learned  transformations  lead  to  im¬ 
proved  retrieval,  classification,  and  clustering  algorithms  due  to  the  more  accurate 
distance  or  similarity  measures.  Here,  we  introduce  the  problem  of  learning  these 
transformations  when  the  underlying  constraint  generation  process  is  dynamic. 
These  dynamics  can  be  due  to  changes  in  either  the  ground-truth  labels  used  to 
generate  constraints  or  changes  to  the  feature  subspaces  in  which  the  class  struc¬ 
ture  is  apparent.  We  propose  and  evaluate  an  adaptive,  online  algorithm  for  learn¬ 
ing  and  tracking  metrics  as  they  change  over  time.  We  demonstrate  the  proposed 
algorithm  on  both  real  and  synthetic  data  sets  and  show  significant  performance 
improvements  relative  to  previously  proposed  batch  and  online  distance  metric 
learning  algorithms. 


1  Introduction 

The  effectiveness  of  many  machine  learning  and  data  mining  applications  rely  on 
an  appropriate  measure  of  pairwise  distance  between  data  points  that  accurately  re¬ 
flects  the  objective,  e.g.,  prediction,  clustering  or  classification.  In  settings  with  clean, 
appropriately-scaled  spherical  Gaussian  data,  standard  Euclidean  distance  can  be  uti¬ 
lized.  However,  when  the  data  is  heavy  tailed,  multimodal,  contaminated  by  outliers, 
irrelevant  or  replicated  features,  or  observation  noise.  Euclidean  inter-point  distance 
can  be  problematic,  leading  to  bias  or  loss  of  discriminative  power. 

As  a  result,  many  unsupervised,  data-driven  approaches  for  identifying  appropriate 
distances  between  points  have  been  proposed.  These  methodologies,  broadly  taking  the 
form  of  dimensionality  reduction  or  data  “whitening”,  aim  to  utilize  the  data  itself  to 
learn  a  transformation  of  the  data  that  embeds  it  into  a  space  where  Euclidean  distance 
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is  appropriate.  Examples  of  such  unsupervised  techniques  include  Principal  Compo¬ 
nent  Analysis  0,  Multidimensional  Scaling  fi~3l.  covariance  estimation  lfl3l  l2l.  and 
manifold  learning  m.  Such  unsupervised  methods  do  not  have  the  benefit  of  human 
input  on  the  distance  metric,  and  overly  rely  on  prior  assumptions,  e.g.,  local  linearity 
or  smoothness. 

This  paper  proposes  methods  for  distance  metric  learning.  In  this  problem  one 
seeks  to  learn  linear  transformations  of  the  data  that  are  well  matched  to  a  particular 
task  specified  by  the  user.  In  this  case,  point  labels  or  constraints  indicating  point  sim¬ 
ilarity  or  dissimilarity  are  used  to  learn  a  transformation  of  the  data  such  that  similar 
points  are  “close”  to  one  another  and  dissimilar  points  are  distant  in  the  transformed 
space.  Learning  distance  metrics  in  this  manner  allows  a  more  precise  notion  of  dis¬ 
tance  or  similarity  to  be  defined  that  is  related  to  the  task  at  hand. 

Many  supervised  and  semi-supervised  distance  metric  learning  approaches  have 
been  developed  El.  This  includes  online  algorithms  lfl8l  with  regret  guarantees  for 
situations  where  similarity  constraints  are  received  in  a  stream.  In  this  paper,  we  pro¬ 
pose  a  new  way  of  formulating  the  distance  metric  learning  task.  We  assume  the  un¬ 
derlying  ground-truth  distance  metric  from  which  constraints  are  generated  is  evolving 
over  time.  This  problem  formulation  suggests  an  adaptive,  online  approach  to  track 
the  underlying  metric  as  constraints  are  received.  We  present  an  algorithm  for  track¬ 
ing  distance  metrics  based  on  recent  advances  in  composite  objective  mirror  descent 
for  metric  learning  IflOl  (COMID)  and  the  Strongly  Adaptive  Online  Learning  (SAOL) 
framework  proposed  in  |7). 

1.1  Related  Work 

Linear  Discriminant  Analysis  (LDA)  and  Principal  Component  Analysis  (PCA)  are 
classic  examples  of  linear  transformations  for  projecting  data  into  more  interpretable 
low  dimensional  spaces.  Unsupervised  PCA  seeks  to  identify  a  set  of  axes  that  best 
explain  the  variance  contained  in  the  data.  LDA  takes  a  supervised  approach,  minimiz¬ 
ing  the  intra-class  variance  and  maximizing  the  inter-class  variance  given  class  labeled 
data  points. 

Much  of  the  recent  work  in  Distance  Metric  Learning  has  focused  on  learning  Ma- 
halanobis  distances  on  the  basis  of  pairwise  similarity/dissimilarity  constraints.  These 
methods  have  the  same  goals  as  LDA;  pairs  of  points  labeled  “similar”  should  be 
close  to  one  another  while  pairs  labeled  “dissimilar”  should  be  distant.  MMC  [25), 
a  method  for  identifying  a  Mahalanobis  metric  for  clustering  with  side  information, 
uses  semidefinite  programming  to  identify  a  metric  that  maximizes  the  sum  of  dis¬ 
tances  between  points  labeled  with  different  classes  subject  to  the  constraint  that  the 
sum  of  distances  between  all  points  with  similar  labels  be  less  than  some  constant. 

Large  Margin  Nearest  Neighbor  (LMNN)  1231  similarly  uses  semidefinite  program¬ 
ming  to  identify  a  Mahalanobis  distance,  however  it  modifies  the  constraints  to  only 
take  into  account  a  small,  local  neighborhood  for  each  point.  In  this  setting,  the  algo¬ 
rithm  minimizes  the  sum  of  distances  between  a  given  point  and  its  similarly  labeled 
neighbors  while  forcing  differently  labeled  neighbors  outside  of  its  neighborhood.  This 
method  has  been  shown  to  be  computationally  efficient  G4l  and,  in  contrast  to  the  sim¬ 
ilarly  motivated  Neighborhood  Component  Analysis  CD,  is  guaranteed  to  converge  to 
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a  globally  optimal  solution.  Additionally,  constraining  the  optimization  based  only  on 
a  small  neighborhood  of  points  enables  effective  processing  of  multi-modal  classes. 

Information  Theoretic  Metric  Learning  (ITML)  (8)  is  another  popular  Distance 
Metric  Learning  technique.  ITML  minimizes  the  Kullback-Liebler  divergence  between 
an  initial  guess  of  the  matrix  that  parameterizes  the  Mahalanobis  distance  and  a  solution 
that  satisfies  a  set  of  constraints.  The  constraints  in  this  setting  are  based  on  similarity 
and  dissimilarity  pairs  and  are  constructed  such  that  similar  pairs  be  within  some  close¬ 
ness  constant  and  dissimilar  pairs  be  more  distant  than  some  larger  constant.  Online 
and  non-linear  extensions  to  the  ITML  methodology  are  presented  as  well. 

In  a  dynamic  environment,  it  is  necessary  to  be  able  to  compute  multiple  estimates 
of  the  changing  metric  at  different  times,  and  to  be  able  to  compute  those  estimates 
online.  Online  learning  (5}  meets  these  criteria  by  efficiently  updating  the  estimate 
every  time  a  new  data  point  is  obtained,  instead  of  solving  an  objective  function  formed 
from  the  entire  dataset. 

Many  online  learning  methods  have  regret  guarantees,  that  is,  the  loss  in  perfor¬ 
mance  relative  to  a  batch  method  is  provably  small  mm.  In  practice,  however,  the 
performance  of  an  online  learning  method  is  strongly  influenced  by  the  learning  rate 
which  may  need  to  vary  over  time  in  a  dynamic  environment  EEUEI. 

Adaptive  online  learning  methods  attempt  to  address  this  problem  by  continu¬ 
ously  updating  the  learning  rate  as  new  observations  become  available.  For  exam¬ 
ple,  AdaGrad-style  methods  ET!  [9|  perform  gradient  descent  steps  with  the  step  size 
adapted  based  on  the  magnitude  of  recent  gradients.  Follow  the  regularized  leader 
(FTRL)  type  algorithms  adapt  the  regularization  to  the  observations  |20l.  Recently,  a 
method  called  Strongly  Adaptive  Online  Learning  (SAOL)  has  been  proposed,  which 
maintains  several  learners  with  different  learning  rates  and  selects  the  best  one  based 
on  recent  performance  (7).  Several  of  these  adaptive  methods  have  provable  regret 
bounds  EqHBUHi.  These  typically  guarantee  low  total  regret  (i.e.  regret  from  time  0 
to  time  t )  at  every  time  ll20l.  SAOL,  on  the  other  hand,  is  guaranteed  to  have  low  regret 
on  every  subinterval,  as  well  as  low  regret  overall  Q. 

The  remainder  of  this  paper  is  structured  as  follows.  In  Section[2]we  formalize  the 
distance  metric  tracking  problem,  and  section  [3]  reviews  the  existing  COMID  learning 
framework.  Section[4]introduces  our  adaptive  approaches  to  solving  the  distance  metric 
tracking  problem,  and  section[5]presents  our  Strongly  Adaptive  Online  Metric  Learning 
algorithm.  Results  on  both  synthetic  data  and  a  text  review  dataset  are  presented  in 
Section[6]with  discussion  and  future  work  presented  in  Section[7] 

2  Problem  Formulation 

The  goal  of  this  work  is  to  use  analyst  feedback  to  learn  a  metric  on  the  data  space 
that  best  matches  the  goals  of  the  analyst.  We  formulate  the  problem  as  a  cooperative 
dynamic  game  between  the  learner  and  the  analyst.  Both  players’  goal  is  for  the  learner 
to  learn  the  internal  metric  M  used  by  the  analyst.  The  metric  is  changing  over  time, 
making  the  game  dynamic. 

The  analyst  selects  pairs  of  data  points  (xt,  zt)  and  labels  them  as  similar  or  dis¬ 
similar.  The  labels  are  assumed  to  arrive  in  a  temporal  sequence,  hence  the  labels  at  the 
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beginning  may  have  arisen  from  a  different  metric  than  those  at  the  end  of  the  sequence. 

In  sum,  the  learning  goals  include  tracking  the  analyst’s  internal  metric  in  the  pres¬ 
ence  of  metric  changes  and  noise,  and  (equivalently)  finding  an  embedding  which  re¬ 
sults  in  maximal  separation  of  the  clusters  of  interest  to  the  analyst,  enabling  better 
interpretation  and/or  future  feedback  from  the  analyst.  Potential  extensions  which  we 
do  not  have  the  space  to  treat  here  include  exploiting  unlabeled  data  points  JT[,  and/or 
choosing  which  pairs  or  groups  of  pairs  to  present  to  the  analyst  (i.e.  active  learning 

El). 

2.1  Objective  function 

Metric  learning  seeks  to  learn  a  metric  that  encourages  data  points  marked  as  similar  to 
be  close  and  data  points  marked  as  different  to  be  far  apart.  The  Mahalonobis  distance 
is  parameterized  by  M  as 


<4f(X,  55 )  =  (x  -  z)tM(x  -  z) 


(1) 


where  M  £  l"x"  ^  0. 

Suppose  a  set  of  similarity  constraints  are  given,  where  each  constraint  is  the  triplet 
(xt,  z t,yt),  xt  and  zt  are  data  points  in  Rn,  and  the  label  yt  =  +1  if  the  points  xt,  zt 
are  similar  and  yt  =  —  1  if  they  are  dissimilar. 

Following  DU,  we  introduce  the  following  margin  based  constraints: 

<4 r(xt,zt)  <  £t— 1,  V{t\yt  =  l}  (2) 

d2M(xt,zt)  >  n  +  1,  V{t\yt  =  -1} 

where  y  is  a  threshold  that  controls  the  margin  between  similar  and  dissimilar  points. 
A  diagram  illustrating  these  constraints  and  their  effect  is  shown  in  Figure  [T] 

In  typical  fashion,  these  constraints  are  softened  by  penalizing  violation  of  the  con¬ 
straints  with  a  convex  loss  function  £t.  This  gives  the  following  objective: 

1  T 

,  min  ,  ^  +  ^r(M)  (3) 

JVLz^U,u>l  1  — 

-  t=  1 

£t(M,n)  =£(mt),  mt  =  yt(y  -  ufMut),  ut  =  xt  -  zt 


where  r  is  the  regularizes  Kunapuli  and  Shavlik  propose  using  nuclear  norm  regular¬ 
ization  (r(M)  =  ||M||*)  to  encourage  projection  of  the  data  onto  a  low  dimensional 
subspace  (feature  selection/dimensionality  reduction). 

3  Composite  Objective  Mirror  Descent 

One  principled  approach  to  online  learning  involves  viewing  the  acquisition  of  new 
data  points  as  stochastic  realizations  of  the  underlying  distribution,  suggesting  the  use 


4 


Figure  1:  Visualization  of  the  margin  based  constraints  Q,  with  colors  indicating  class. 
The  goal  of  the  metric  learning  constraints  is  to  move  target  neighbors  towards  the 
point  of  interest  (POI),  while  moving  points  from  other  classes  away  from  the  target 
neighborhood. 


of  stochastic  mirror  descent  techniques.  The  authors  of  fl8l  propose  a  composite  ob¬ 
jective  mirror  descent  (COMID)  approach  to  online  metric  learning  that  solves  a  regu¬ 
larized  positive  semidefinite  learning  problem. 

Using  the  COMID  framework  Co|,  for  the  objective  ([3])  we  have  online  learning 
updates  that  iterate  through  the  constraints 

Mt+i  =  arg  min  B^( M,  Mt)  (4) 

+  /it),  M  —  Mt)  +  ?7tp||M||* 

At+i  =  arg  min  (/t,  fit)  +  A*),(a*  -  fit), 

where  B^  is  any  Bregman  divergence  and  rjt  is  the  learning  rate  parameter.  Mo, /to 
are  initialized  to  some  initial  value.  In  lfT8l  a  closed-form  algorithm  for  solving  the 
minimization  in  0  is  developed  for  a  variety  of  common  losses  and  Bregman  diver¬ 
gences,  involving  rank  one  updates  and  eigenvalue  shrinkage.  A  kernel  version  of  the 
algorithm  is  also  available  for  the  batch  case. 

By  standard  mirror  descent  analysis,  this  method  has  0{y/T)  regret  for  the  static 
case  when  the  learning  rate  is  set  as  rjt  =  rj/y/t.  For  online  learning  of  a  static  objec¬ 
tive,  the  learning  rate  will  decay  to  zero.  However,  in  the  case  of  a  dynamic  objective, 
the  learning  rate  must  not  decay  to  zero  so  that  the  estimate  of  the  parameters  will 
be  most  strongly  influenced  by  the  recent  constraint  history.  This  was  proposed  in  a 
generic  online  learning  scenario  in  lfl2l.  where  low  regret  guarantees  were  derived, 
which  we  extend  to  metric  learning  in  the  supplementary  material.  Critically,  the  op¬ 
timal  learning  rate  depends  on  how  fast  the  objective  is  changing.  We  propose  two 
methods  for  addressing  this  issue  and  for  learning  distance  metrics  that  change  over 
time  in  an  arbitrary  way. 
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4  Dynamic  Metric  Learning  Algorithms 

4.1  Windowed  Batch  Approach 

Intuitively,  if  the  underlying  metric  is  changing  smoothly,  the  most  recent  samples  are 
the  most  relevant.  Similarly  to  the  covariance  estimation  method  of  fl26l.  it  is  possible 
to  apply  batch  methods  to  learn  a  changing  metric.  At  any  given  time  the  importance 
of  past  samples  are  weighted  by  their  recency,  and  a  batch  method  is  used  to  estimate 
the  current  metric.  This  is  then  repeated  at  various  times,  giving  in  effect  a  weighted 
sliding  window  of  samples  from  which  to  learn.  The  resulting  objective  is 

K 

min  V'afc4+t-A'(Mt,/rt)  +  pr(Mt)  (5) 

k—1 

where  «/i:  is  such  that  Ylk=i  Uk  =  1-  This  function  is  convex.  Nonrectangular  windows 
can  be  more  difficult  computationally,  and  repeated  batch  processing  is  not  efficient. 

In  our  experiments,  we  use  COMID  to  solve  the  objective  ([5]»  at  each  step.  Since 
from  t  to  t  +  1  the  objective  function  only  changes  slightly  if  K  is  large  enough  and  a 
is  sufficiently  smooth,  computational  complexity  is  reduced  by  initializing  the  current 
update  with  the  previous  estimate. 

4.2  Adaptive  Online  Approach 

In  an  online  learning  scenario  where  drift  is  occurring,  as  noted  above,  the  choice  of  the 
learning  rate  rjt  can  be  critical.  Furthermore,  if  discrete  shifts  and/or  changes  in  drift 
occur,  the  optimal  r]t  may  change  with  time,  and  setting  a  drift  rate  dependent  r/f  using 
cross  validation  is  not  practical  in  a  truly  online  setting.  Hence,  a  method  of  adaptively 
choosing  the  learning  rate  in  an  online  fashion  is  desirable. 


Figure  2:  Strongly  Adaptive  Online  Learning  -  Learners  at  multiple  scales  run  in  par¬ 
allel.  Observed  losses  for  each  are  used  to  create  weights  that  are  used  to  select  the 
current  scale. 
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Figure  3:  Stochastic  mirror  descent  learners  and  initialization.  Each  yellow  and  red 
learner  is  initialized  by  the  output  of  the  previous  learner  of  the  same  color,  that  is,  the 
learner  of  the  next  shorter  scale. 


While  any  method  of  adaptively  setting  r/t  may  be  used,  in  this  work  we  chose  the 
Strongly  Adaptive  Online  Learning  (SAOL)  framework  of  Cl  because  of  its  ability  to 
perform  well  on  every  time  subinterval. 

SOAL  proposes  running  a  bank  of  multiple  online  base  learners  in  parallel,  each 
having  parameters  optimized  for  learning  on  an  interval  of  a  different  length,  or  alter¬ 
natively  in  our  case,  for  learning  a  metric  that  has  its  drift  spread  out  over  an  interval 
of  a  different  length.  SAOL  then  uses  the  recent  history  of  losses  suffered  by  each 
learner  to  select  the  learner  that  is  most  accurate  at  the  current  time  (Ligure[2|.  SAOL 
has  strong  theoretical  guarantees  on  the  regret  on  every  subinterval,  as  opposed  to  the 
traditional  bounds  on  regret  over  the  entire  learning  period.  This  guarantees  that  the 
estimate  will  be  sufficiently  responsive  to  make  it  accurate  at  all  times. 

We  use  the  COMID  online  learners  of  Section[3](with  learning  rate  i]t )  as  the  base 
learners.  Algorithm [T| shows  the  SAOL  algorithm  applied  to  the  metric  learning  prob¬ 
lem.  The  next  section  explains  SAOL  and  its  implementation. 

5  SAOML 

5.1  SAOL  Framework 

We  first  describe  the  SAOL  framework  of  0.  SAOL  is  based  on  dyadically  partitioning 
the  temporal  axis  into  intervals  and  assigning  a  black  box  learner  to  each  interval. 
Specifically,  define  a  set  I  of  intervals  I  =  [tri,  Oa]  such  that  the  lengths  |/|  of  the 
intervals  are  proportional  to  powers  of  two,  i.e.  |/|  =  /0 2J ,  with  an  arrangement  that  is 
a  dyadic  partition  of  the  temporal  axis.  The  first  interval  of  length  |/|  starts  at  t  =  \I\ 
(see  Ligure[2|),  and  additional  intervals  of  length  |/|  exist  such  that  the  rest  of  time  is 
covered. 

Every  interval  I  is  associated  with  a  base  learner  that  operates  on  that  interval. 
Hence,  at  a  given  time  t,  a  set  ACTIVE(f)  C  I  of  floor(log2  t)  intervals/learners  are 
active,  running  in  parallel.  The  base  learner  of  given  interval  I  is  designed  to  have  low 
total  regret  (0(y/Jl\))  on  that  interval  in  the  static  case.  Because  the  parameters  being 
learned  are  changing  with  time,  learners  designed  for  low  regret  at  different  scales 
will  have  different  performance  (analogous  to  the  classic  bias/variance  tradeoff).  In 
other  words,  there  is  an  optimal  scale  |/|  that  can  be  selected  from  the  base  learning 
ensemble. 
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Algorithm  1  Strongly  Adaptive  Online  Metric  Learning 
1:  Initialize:  w\(I) 

2:  for  t  =  1  to  T  do 

3:  Initialize  new  learner  if  needed. 

4:  Choose  I  £  ACTIVE(f)  according  to  ([TO]). 

5:  Mirror  Descent  update  |4])  for  all  active  learners. 

6:  Set  Mt  <r-  /it  <—  /tt(/) 

7:  Obtain  constraint  (xt,  zt,  yt),  compute  loss  £t,iog(')- 

8:  Update  weights  for  all  t  £  I: 

r *(/)  =  (E  ^WMt(/),/it(/))j 

-et,iog(M 

Wt+i{l)  =wt(I){  1  +  min{l/2, 1/ y/i Ti\}rt(I)) 


9:  end  for 

10:  Return  {Mt, /it}. 


It  remains  to  select  the  output  of  one  of  the  active  learners  and  use  it  as  the  final 
estimate  at  any  given  time  t.  In  (7J,  it  is  proposed  to  compute  weights  for  each  learner. 
These  weights  are  updated  based  on  the  learner’s  recent  estimated  regret,  which  is 
estimated  as  described  below,  and  are  used  to  randomly  select  a  learner.  In  our  work, 
we  update  the  weights  according  to 

wt+i(I)  =wt(I){l  +77/ry(/)),  Mt  £  I  (6) 

/*(!)) 

for  all  I  £  I,  where  ?//  =  min{l/2,  l/\/j7[},  where  Mt(7),  /tt(J)  are  the  outputs  at 
time  t  of  the  learner  on  interval  /,  and  rf  ( I )  is  called  the  estimated  regret  of  the  learner 
on  interval  I  at  time  t.  Essentially,  this  is  highly  weighting  low  loss  learners  and  lowly 
weighting  high  loss  learners. 

For  any  given  time  t,  the  output  of  the  learner  of  interval  I  £  ACTIVE(f)  is 
randomly  selected  as  the  output  of  the  SAOL  learner  with  probability 

Pr(Mt  =  Mt(I),  jlt  =  /tt(/))  =  = - ^ , 

zL/eACTTVE(t)  wt{l) 

MI  £  ACTIVE(f).  (7) 

In  (7),  SAOL  assumes  that  the  loss  ('{■)  lies  between  0  and  1.  We  propose  a  way  to 
apply  this  to  our  unbounded  loss  in  the  next  subsection. 
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5.2  Implementation 


We  note  that  0  does  not  provide  any  further  implementation  details,  and  that  selecting 
a  learner  at  random  can  be  problematic. 

For  stochastic  mirror  descent  learners,  we  propose  the  following  approach.  Let 
each  learner  be  a  stochastic  composite  mirror  descent  learner  0  having  a  constant 
learning  rate  proportional  to  the  inverse  square  of  the  length  of  the  interval,  i.e.  r)t(I )  = 

vo/VWl 

Each  mirror  descent  learner  (besides  the  coarsest)  at  level  j  (|/  =  7o2J)  is  ini¬ 
tialized  to  the  current  estimate  of  the  next  coarsest  learner  (level  j  —  1).  Furthermore, 
the  weight  wt  is  carried  over  from  said  coarser  learner.  This  strategy  is  equivalent 
to  “backdating”  the  interval  learners  so  as  to  ensure  appropriate  convergence  has  oc¬ 
curred  before  the  interval  of  interest  is  reached,  and  is  effectively  a  “quantized  square 
root  decay”  of  the  learning  rate  (Figure[3]l. 

In  the  SAOL  framework,  the  loss  must  lie  between  0  and  1.  For  convexity  reasons, 
the  loss  function  we  use  in  0  is  unbounded.  However,  this  is  a  relaxation  of  the 
underlying  0-1  loss.  Hence  for  purposes  of  updating  the  weights,  we  use  the  logistic 
loss 


Zt,iog(xt\Mt,  nt)  =  logistic  —  J  (8) 

where  the  argument  is  scaled  by  /j,t  because  only  the  relative  scale  of  in  and  Mt  is 
relevant  to  the  similar/dissimilar  boundary.  The  constant  c  scales  the  “buffer  region” 
created  by  the  loss.  We  set  c  =  2  in  all  our  experiments.  Incorporating  the  logistic  loss 
into  0, 


n(i) 


wt{I) 

Wt 


£tJog(Mt(I),M^ 


(9) 


—  £t,lag(Mt(I),  Ht(I)) 

wt+1(I)=wt(I)(l+r]Irt(I)),  Vf€J. 


In  the  original  SAOL  framework,  the  current  estimates  are  selected  randomly. 
While  this  gives  useful  bounds  on  the  expected  regret,  it  means  that  a  known  poor  es¬ 
timate  is  chosen  with  nonzero  probability.  We  instead  propose  the  following:  Choose 
the  /  that  minimizes  the  expected  total  Bregman  divergence. 


m  = 


arg  min 
JeACTIVE(t ) 


E 

l£ACTIVE(t ) 


M1) 

E  iwt{iy 


(10) 


If  Bv,  is  the  Frobenius  norm,  then  this  is  equivalent  to  choosing  the  estimate  closest  to 
the  expectation. 


5.3  Performance  Guarantees 

In  the  game  theory  literature,  learning  rates  for  stochastic  mirror  descent  techniques 
have  been  developed  to  be  able  to  play  dynamic  games,  i.e.,  to  solve  optimization 
problems  that  are  changing  over  time  00  HU  [I6l. 
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In  this  section,  we  parameterize  our  convex  loss  as  ft(dt)  =  +  r{9t),  where 

6  =  [M,  fi\.  Since  the  optimal  parameter  value  is  changing  in  a  dynamic  environment, 
defining  the  static  regret  of  an  algorithm  B  on  an  interval  I  as 

rb{i)  =  E  Mb)  -  mmE  m  (ID 

tei  tei 


is  not  useful. 

A  more  useful  generalization  of  the  standard  static  regret  is  as  follows.  Let  W  be  a 
possible  set  of  actions,  in  this  case  the  set  of  possible  sequences  w  =  {Ot}tei  satisfying 
some  criterion.  This  allows  for  a  dynamically  changing  estimate.  Then,  the  dynamic 
regret  of  an  algorithm  B  is  defined  as 

Rb(I)  =  E  ftfo)  -  min  E  (12) 

z '  wew  z ' 

tei  tei 

In  lH2l  the  authors  define  dynamic  regret  by  setting  VV  =  {w|  X/tgJ  ll^t+1  —  @t\\  ^ 
7},  i.e.  bounding  the  total  amount  of  variation  in  the  estimated  parameter.  Without 
temporal  regularization,  minimizing  the  loss  would  cause  6t  to  grossly  overfit,  hence 
the  constraint  on  how  fast  8t  can  change. 

We  now  use  this  notion  of  dynamic  regret  and  extend  it  to  the  stronger  notion 
of  strongly  adaptive  regret.  Following  Q,  we  define  strongly  adaptive  regret  of  an 
algorithm  A  as 


S  A-Regret^  (r)  =  max  E[RAI)\  (13) 

/=[9,9+r-l]C[0,T] 

where  the  expectation  is  with  respect  to  the  possibly  random  output  of  the  algorithm. 
We  call  an  algorithmsfrong/y  adaptive  if  SA-Regret^(r)  =  0(poly(log T)R-p(r)), 
where  Rv{t)  is  the  regret  of  the  learning  problem,  i.e.  the  best  possible  regret  bound. 

Low  strongly  adaptive  regret  implies  that  the  dynamic  regret  is  low  on  every  subin¬ 
terval,  instead  of  only  low  in  the  aggregate.  As  a  result,  a  strongly  adaptive  algorithm 
must  quickly  adapt  to  changes,  otherwise  the  subintervals  immediately  following  the 
change  will  not  have  low  regret,  even  if  the  total  regret  over  all  time  is  low. 

In  the  supplementary  material,  we  prove  the  following: 

Theorem  1  (SAOML).  Let  W  =  {w|  Y^t  \\@t+i  —  &t\\  <  7}  and  B  be  the  COMID 
algorithm  of  0  with  ijt{I)  =  Vo/ y/PT  and  fixed  p.  Then  the  strongly  adaptive  online 
learner  SAOLB  using  B  as  the  black  box  learners  satisfies 

Rsaol(I)<  217^^1+  7)|/|1/2+  40 logis  +  im1'2  (14) 

for  some  constant  C  and  every  interval  I  =  [q,  s].  In  particular,  SAOL 8  is  strongly 
adaptive. 
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6  Results 


6.1  Synthetic  Data 

We  run  our  metric  learning  algorithms  on  synthetic  datasets  undergoing  different  types 
of  simulated  metric  drift.  The  first  dataset  we  consider  has  three  classes,  with  a  50- 
20-30%  split  of  the  prior  probability.  Each  class  is  associated  with  a  Gaussian  blob  in 
3-dimensional  space,  with  each  class  having  a  different  mean  and  covariance.  For  each 
of  2000  data  points,  we  select  a  class  at  random  and  generate  a  3-dimensional  point 
from  that  classes’  Gaussian  distribution.  We  then  embed  the  3-dimensional  dataset  in  a 
random  subspace  of  a  25-dimensional  space.  The  remaining  22-dimensional  subspace 
is  filled  with  iid  Gaussian  noise. 

We  generate  a  series  of  T  constraints  from  random  pairs  of  points  in  the  dataset, 
incorporating  simulated  drift  (described  below),  running  each  experiment  with  1000 
random  trials.  For  each  experiment  conducted  in  this  section,  we  evaluate  performance 
using  three  metrics.  First  the  data  points  in  the  first  two  dimensions  (as  determined 
by  the  SVD  of  My)  of  the  final  learned  embedding,  color  coded  according  to  their 
true  classes  are  shown.  We  plot  the  K-nearest  neighbor  error  rate,  using  the  learned 
embedding  at  each  time  point,  averaging  over  all  trials.  We  quantify  the  clustering  per¬ 
formance  by  plotting  the  empirical  probability  that  the  normalized  mutual  information 
(NMI)  of  the  K-means  clustering  of  the  unlabeled  data  points  in  the  learned  embedding 
at  each  time  point  exceeds  0.85  (out  of  a  possible  1).  We  believe  clustering  NMI,  rather 
than  k-NN  performance,  is  a  more  realistic  indicator  of  metric  learning  performance, 
at  least  in  the  case  where  finding  a  relevant  embedding  is  the  primary  goal. 


Figure  4:  Dataset  1 .  The  dataset  remains  fixed  throughout,  no  drift  occurs  as  the  con¬ 
straints  are  observed.  Shown  as  a  function  of  time  is  the  mean  k-NN  error  rate  and  the 
probability  the  k-means  NMI  >  0.85.  Note  the  failure  of  ITMF  and  similar  perfor¬ 
mance  of  the  remaining  online  and  batch  methods. 

Figure[4]shows  the  static  drift-free  results  for  nonadaptive  COMID,  S  AOMF,  FMNN 
(batch),  our  weighted  batch  method,  and  online  ITMF.  All  parameters  were  set  via 
cross  validation  and  remain  constant  through  all  experiments  on  the  dataset.  Online 
ITMF  fails  due  to  its  bias  agains  low-rank  solutions  Qji),  and  the  other  methods  perform 
comparably  as  there  is  no  drift.  Discrete  drift  where  at  time  T/2  the  25  dimensions  are 
randomly  permuted  is  shown  in  Figure  [5]  and  continuous  drift  with  a  changing  rate  is 
shown  in  Figure  [6]  To  simulate  continuous  drift,  at  each  time  step  we  perform  a  small 
random  rotation  of  the  dataset,  and  at  time  T/2  the  rate  of  rotation  is  increased  by  a 
factor  of  6.  It  can  be  seen  that  the  weighted  batch  and  especially  SAOMF  respond 
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quickly  to  drift,  performing  significantly  better  than  the  nonadaptive  COMID. 


_<3 

J 

Figure  5:  Dataset  1.  Random  permutation  of  the  data  dimensions  occurs  at  the  halfway 
point.  Top  to  bottom:  Nonadaptive  COMID,  adaptive  SAOML,  and  the  windowed 
batch  method.  Note  the  slow  recovery  of  the  nonadaptive  method  after  the  change. 

A  second  dataset  identical  to  the  one  described  above,  except  with  an  alternative 
generative  cluster  model,  was  also  used.  For  each  of  data  points,  we  assign  two  classes 
(corresponding  to  different  possible  partitions  A  and  B  of  the  data),  both  selected  at 
random,  and  for  both  generate  a  3-dimensional  point  from  that  classes’  Gaussian  dis¬ 
tribution.  The  two  points  are  then  concatenated  into  a  single  6-dimensional  point.  We 
then  embed  the  entire  6-dimensional  dataset  in  a  random  subspace,  with  the  remaining 
dimensions  filled  with  iid  Gaussian  noise  as  before. 

The  results  for  no  drift  are  shown  in  Figure  [7]  similar  to  those  found  with  the  first 
dataset.  We  also  consider  drift  between  partitions  (Figure  [8]>:  At  first,  partition  A  is 
used,  and  at  time  T/2,  the  labeling  is  changed  to  partition  B.  By  way  of  interpretation, 
the  goal  of  metric  learning  is  to  identify  the  3-dimensional  subspace  corresponding 
to  the  labeling  of  interest,  and  project  away  the  noisy  subspaces,  thus  improving  the 
performance  of  secondary  algorithms.  The  nonadaptive  method  fails  to  quickly  catch 
up  to  the  shift,  whereas  SAOML  effectively  increases  the  learning  rate  parameter  to 
quickly  learn  the  new  paradigm. 

6.2  Clustering  Product  Reviews 

As  an  example  real  data  task,  we  consider  clustering  Amazon  text  reviews,  using  the 
Multi-Domain  Sentiment  Dataset  0.  We  use  the  11402  reviews  from  the  Electronics 
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Figure  6:  Dataset  1.  Continuous  slow  rotational  drift  of  the  dataset  occurs,  followed 
by  more  rapid  drift.  From  top  to  bottom:  Nonadaptive  COMID,  SAOML,  and  the 
weighted  batch  method.  The  nonadaptive  method  with  its  fixed  learning  rate  performs 
poorly  during  rapid  drift  relative  to  the  adaptive  methods. 


and  Books  categories,  and  preprocess  the  data  by  computing  word  counts  for  each 
review  and  2369  commonly  occurring  words.  Two  possible  clusterings  of  the  reviews 
are  considered:  product  category  (books  or  electronics)  and  sentiment  (positive:  star 
rating  4/5  or  greater,  or  negative:  2/5  or  less). 

Figures[9]and[T0]show  the  first  two  dimensions  of  the  embeddings  learned  by  static 
COMID  for  the  category  and  sentiment  clusterings  respectively.  Also  shown  are  the  2- 
dimensional  standard  PCA  embeddings,  and  the  k-NN  classification  performance  both 
before  embedding  and  in  each  embeddings.  As  expected,  metric  learning  is  able  to  find 
embeddings  with  improved  class  separability.  We  emphasize  that  while  improvements 
in  k-NN  classification  are  observed,  we  use  k-NN  merely  as  a  way  to  quantify  the 
separability  of  the  classes  in  the  learned  embeddings.  In  these  experiments,  we  set  the 
regularizer  r(-)  to  the  LI  norm. 

We  then  conducted  drift  experiments  where  the  clustering  changes.  The  change 
happens  after  the  metric  learner  for  the  original  clustering  has  converged,  hence  the 
nonadaptive  learning  rate  is  effectively  zero.  For  each  change,  we  show  the  k-NN  error 
rate  in  the  learned  SAOML  embedding  as  it  adapts  to  the  new  clustering.  Emphasizing 
the  visualization  and  computational  advantages  of  a  low-dimensional  embedding,  we 
computed  the  k-NN  error  after  projecting  the  data  into  the  first  5  dimensions  of  the 
embedding.  Also  shown  are  the  results  for  a  learner  where  an  oracle  allows  reinitial¬ 
ization  of  the  metric  to  the  identity  at  time  zero,  and  the  nonadaptive  learner  for  which 
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Figure  7:  Dataset  2.  The  dataset  and  labeling  remains  fixed  throughout,  no  drift  occurs. 
Shown  is  the  average  performance  for  each  method. 


Figure  8:  Dataset  2:  two  possible  clusterings  of  the  data  exist.  For  the  first  half,  the 
first  clustering  is  used  to  generate  the  labels,  and  in  the  second  half  a  switch  is  made  to 
the  second  possible  clustering.  Top:  Average  performance;  Bottom:  an  example  final 
embedding  for  each  method.  Note  the  failure  of  the  batch  method  (LMNN),  and  the 
poor  performance  of  the  nonadaptive  method. 


the  learning  rate  is  not  increased.  Figure  [IT]  (left)  shows  the  results  when  the  clustering 
changes  from  the  four  class  sentiment  +  type  partition  to  the  two  class  product  type 
only  partition,  and  Figure  [IT]  (right)  shows  the  results  when  the  partition  changes  from 
sentiment  to  product  type.  In  the  first  case,  the  similar  clustering  allows  SAOML  to 
significantly  outperform  even  the  reinitialized  method,  and  in  the  second  remain  com¬ 
petitive  where  the  clusterings  are  unrelated. 


7  Conclusion  and  Future  Work 

We  introduced  the  problem  of  metric  learning  in  a  changing  environment,  and  pre¬ 
sented  an  efficient,  strongly  adaptive  online  algorithm  having  strong  theoretical  per¬ 
formance  guarantees.  Performance  of  our  algorithms  was  evaluated  both  on  synthetic 
and  real  datasets,  demonstrating  the  ability  of  SAOML  to  learn  and  adapt  quickly  in 
the  presence  of  changes  both  in  the  clustering  of  interest  and  in  the  underlying  data 
distribution. 
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Figure  9:  Metric  learning  for  product  type  clustering.  Book  reviews  blue,  electronics 
reviews  red.  Original  LOO  k-NN  error  rate  15.3%.  Left:  First  two  dimensions  of 
learned  SAOML  embedding  (LOO  k-NN  error  rate  11.3%).  Right:  embedding  from 
standard  PCA  (k-NN  error  20.4%). 


Figure  10:  Metric  learning  for  sentiment  clustering.  Positive  reviews  blue,  negative 
red.  Original  LOO  k-NN  error  rate  35.7%.  Left:  First  two  dimensions  of  learned 
SAOML  embedding  (LOO  k-NN  error  rate  23.5%).  Right:  embedding  from  standard 
PCA  (k-NN  error  41 .9%). 


Potential  directions  for  future  work  include  the  learning  of  more  expressive  metrics 
beyond  the  Mahalanobis  metric,  the  incorporation  of  unlabeled  data  points  in  a  semi- 
supervised  learning  framework,  and  the  incorporation  of  an  active  learning  framework 
to  select  which  pairs  of  data  points  to  obtain  labels  for  at  any  given  time. 
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Distance  Metric  Tracking:  Supplementary  Material 


1.  Online  DML  Dynamic  Regret 

In  this  section,  we  derive  the  dynamic  regret  of  our  CO- 
MID  metric  learning  algorithm.  Recall  that  the  COMID 
algorithm  is  given  by 

M(+i  =  arg  min  (M,  Mt)  (1) 

+  /T(),  M  —  Mt)  +  ?7tp|jM||* 

ft+i  =arg  min  B^(fi,  ft)  +  ?7tVM£f(Mf ,  /xt)'(/z  -  ft), 
/2>1 

where  B,:,  is  any  Bregman  divergence  and  rjt  is  the  learning 
rate  parameter.  From  (Hall  &  Willett,  2015)  we  have: 

Theorem  1. 

Ge=  max  ||V/(0)|| 
eee/ec  v 

</Wr  =  imax||V^(6»)|| 

Dmax  =  max  Bj,(6  ||0) 

0,9'ee  T 


where  d(x,  z)  =  ||x  —  z|| 2  is  the  standard  Euclidean  dis¬ 
tance.  The  other  two  quantities  are  guaranteed  to  exist  and 
depend  on  the  choice  of  Bregman  divergence  and  c.  Thus, 

Corollary  1  (Dynamic  Regret:  ML  COMID).  Let  the  se¬ 
quence  Mf,  jit  be  generated  by  (1),  and  let  be  an 

arbitrary  sequence  with  ||Mt  ||  <  c  and  J2t=i  llMt+i  - 
Mf  ||F  <  7.  Then  using  qt+1  <  rjt  gives 


RT({Mt})  < 


Dmax 

VT+1 


(3) 


and  setting  qt  =  rf0/ \fT, 

Rt({ Mt})  <Vr  f  Dmao^  +  ^rnaxVj^t})  +  VpGj 

'  V  VO  2cr 


=0  ^Vt[1  +  l|Mt+t  -  Mt||F] 


(4) 


for  any  sequence  {0f }. 


Let  the  sequence  6t  =  [M t,ft\  t  =  1,  •  •  •  ,T  be  gener¬ 
ated  via  the  COMID  algorithm,  and  let  w  be  an  arbitrary 
sequence  in  W{w\  Yhtci  ll^t+i  —  ®t\\  <  7}-  Then  using 
Vt+i  <  V t  gives 

Rt(®t)< - h - 7  +7r/^Vt  (2) 

Vt+i  Vt  2  a  J 

Using  a  decaying  learning  rate  rjt,  we  can  then  prove  a 
bound  on  the  dynamic  regret  for  a  quite  general  set  of 
stochastic  optimization  problems. 

Applying  this  to  our  problem,  we  obtain  the  following.  As¬ 
sume  a  fixed  (i.  Then  for  the  estimation  of  Mt  we  have 

Ge=  max  ||V  fi)  +  p||Mj|*)||2 

||M||<c,t,/i 

fimax  =  7  max  ||V^(M)||a 
2  l|M||<c 

Dmax=  max  S^(M'||M) 

||M||.||M'|!<c 

For  £t(-)  being  the  hinge  loss  and  V’  =  II  ‘  IIf. 

Ge  <  y/(ma xd2(xt,  zt)  +  p )2 

fmax  =  C\Jn 
Dmax  —  2csfn 


Corollary  1  is  a  bound  on  the  regret  relative  to  the  batch 
estimate  of  Mf  that  minimizes  the  total  batch  loss  subject 
to  abounded  variation  ||Mt+i  —  Mt||F.  Furthermore, 
Vt  =  Vo/y/t-  gives  the  same  bound  as  (4). 

In  other  words,  we  pay  a  linear  penalty  on  the  total  amount 
of  variation  in  the  underlying  parameter  sequence.  From 
(4),  it  can  be  seen  that  the  bound-minimizing  r/0  increases 
with  increasing  ||Mt+i  —  Mt||F,  indicating  the  need 
for  an  adaptive  learning  rate. 

For  comparison,  if  the  metric  is  in  fact  static  then  by 
standard  stochastic  mirror  descent  results  (Hall  &  Willett, 
2015) 

Theorem  2  (Static  Regret).  If  Mj  =  0  and  rft  = 
(2 oDmax)1/2 /(GfVT),  then 

Rr({Mt})  <  Gf(2TDmax/a j1/2.  (5) 


The  following  theorem  is  from  (Daniely  et  al.,  2015), 
slightly  modified  to  accomodate  our  different  definition  of 
strongly  adaptive  regret. 

Theorem  3.  Fix  a  set  W  and  choose  an  algorithm  B  such 
that 

RB(T)  <  CTa  (6) 


2.  Strongly  Adaptive  Regret 
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Distance  Metric  Tracking 


for  all  T  >  0  and  some  constants  a  £  (0, 1),  C  >  0.  Then 
the  strongly  adaptive  online  learner  SAOL ®  using  B  as 
the  black  box  learners  satisfies 


R‘ 


SAOL 


(/)<— -  C|/r+401og(S  +  l)|/|1/2  (7) 


2“  -  1 


for  every  interx’al  I  =  [q,  s].  In  particular,  SAOL13  will  be 
strongly  adaptive  if  a  > \  and  B  has  low  regret. 

Apply  to  low  dynamic  regret  of  mirror  descent.  (4)  Corol¬ 
lary  1. 

From  Corollary  1,  COMID  with  //,  =  ?/0 / \ff  satisfies  the 
black-box  learner  condition  (6)  with  a  =  1/2.  Hence,  to 
apply  Theorem  3  to  SAOML,  it  remains  to  normalize  the 
loss  function  to  between  0  and  1 . 

As  noted  in  Corollary  1,  it  is  reasonable  to  assume  that 
||M||  <  c.  Hence  the  loss  function  is  bounded  by 

£t(Mt,pt)  <  k  =  <(cmaxt  ||xt  —  zt|||)  and  can  be  nor¬ 
malized  to  the  appropriate  range.  We  thus  have 

Theorem  4  (SAOML).  Let  W  =  H  ||0t+i  -  9 1\\  < 
7}  and  B  be  the  COMID  algorithm  of  (1)  with  rjt.(I)  = 
rio/sJ\I\  and  fixed  p.  Then  the  strongly  adaptive  online 
learner  SAOL13  using  B  as  the  black  box  learners  satisfies 


Rsaol{I)  < 


2V2  _  l 


0(1  +  7)|  I\1/2  +  40  log(s  +  1)|/|1/2 

(8) 


for  some  constant  C  and  every  interval  I  =  [q,  s].  In  par¬ 
ticular,  SAOL13  is  strongly  adaptive. 

We  note  that  this  bound  is  stronger  than  those  considered 
in  (Daniely  et  al.,  2015)  as  it  incorporates  dynamic  regret 
in  the  definition  of  strongly  adaptive  regret. 
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